Client Report - When Was It Built? ML Analysis

Course DS 250

Author

Ben Jacobs

When Was it Built?

Did you know that living area and selling price are effective in determining when a house was built?

‘Dwellings’ is a dataset containing information about housing in Denver. Using this dataset, can we train a machine learning algorithm to correctly differentiate between houses built prior to 1980, and which factors are the most important? Let’s create a classification model to find out.

Here is a short table showing what the data looks like:

Show the code
import pandas as pd
import altair as alt
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.graph_objs as go
from plotly.subplots import make_subplots

from tabulate import tabulate

import pandas as pd

from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

# Read csv into a pandas object
housing = pd.read_csv('https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv')
Show the code
housing.head()
parcel abstrprd livearea finbsmnt basement yrbuilt totunits stories nocars numbdrm ... arcstyle_THREE-STORY arcstyle_TRI-LEVEL arcstyle_TRI-LEVEL WITH BASEMENT arcstyle_TWO AND HALF-STORY arcstyle_TWO-STORY qualified_Q qualified_U status_I status_V before1980
0 00102-08-065-065 1130 1346 0 0 2004 1 2 2 2 ... 0 0 0 0 0 1 0 1 0 0
1 00102-08-073-073 1130 1249 0 0 2005 1 1 1 2 ... 0 0 0 0 0 1 0 1 0 0
2 00102-08-078-078 1130 1346 0 0 2005 1 2 1 2 ... 0 0 0 0 0 1 0 1 0 0
3 00102-08-081-081 1130 1146 0 0 2005 1 1 0 2 ... 0 0 0 0 0 1 0 1 0 0
4 00102-08-086-086 1130 1249 0 0 2005 1 1 1 2 ... 0 0 0 0 0 0 1 1 0 0

5 rows × 51 columns

What Factors are Most Important?

To help train the algorithm, we need to identify variables in the data that the algorithm can use to sort each house. One way to identify the variables is to find variables that have high positive or negative correlation with houses being built before 1980. We can use a heatmap that shows how much of a correlation each variable has, linearly, between -1 and 1. Below we can see that correlation where -1 is represented as Red, and +1 is represented in Blue. We will pick variables that are the most red or blue.

Show the code
import matplotlib as plt
import pandas as pd
from plotly.subplots import make_subplots

housing2 = housing.drop(columns=housing.columns[0])  # Drop the first column

corr = housing2.corr()

# Create heatmap plot with Plotly
data = [go.Heatmap(z=[corr['yrbuilt']], colorscale='RdBu', x=housing2.columns)]

# Create subplots
fig = make_subplots(rows=1, cols=1)

# Add heatmap trace to the subplot
fig.add_trace(data[0], row=1, col=1)

# Update layout
fig.update_layout(title_text='Heatmap for yrbuilt', height=600)

# Make y-axis scrollable
fig.update_yaxes(fixedrange=False)

# Show plot
fig.show()

From the chart, the following variables have been selected:

Show the code
# Define the variables of interest
variables_of_interest = ['stories', 'numbaths','quality_C', 'gartype_Att','gartype_Det','arcstyle_ONE-STORY','livearea','sprice','status_V',"abstrprd","finbsmnt","nocars","deduct"]

# Calculate the correlations with 'yrbuilt'
correlation_with_yrbuilt = housing2[variables_of_interest].corrwith(housing2['yrbuilt'])

# Create a DataFrame to store the correlations
correlation_table = pd.DataFrame({'Variable': variables_of_interest, 'Correlation with yrbuilt': correlation_with_yrbuilt})

# Display the correlation table
print(correlation_table.to_string(index=False))
          Variable  Correlation with yrbuilt
           stories                  0.306632
          numbaths                  0.383993
         quality_C                 -0.311326
       gartype_Att                  0.440297
       gartype_Det                 -0.429308
arcstyle_ONE-STORY                 -0.418635
          livearea                  0.294126
            sprice                  0.174688
          status_V                  0.301602
          abstrprd                  0.006508
          finbsmnt                 -0.019738
            nocars                  0.098045
            deduct                 -0.055379

Let’s take a closer look at two variables to see visually why they might correlate. We will look at numbaths and status_v, which have a .38 and .30 positive correlation respectively.

Show the code
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Define traces for each scatter plot
trace2_before1980 = go.Scatter(x=housing2[housing2['before1980'] == 1]['yrbuilt'],
                               y=housing2[housing2['before1980'] == 1]['numbaths'],
                               mode='markers', name='before 1980', marker=dict(color='blue'))

trace2_after1980 = go.Scatter(x=housing2[housing2['before1980'] == 0]['yrbuilt'],
                              y=housing2[housing2['before1980'] == 0]['numbaths'],
                              mode='markers', name='after 1980', marker=dict(color='red'))

trace3_before1980 = go.Scatter(x=housing2[housing2['before1980'] == 1]['yrbuilt'],
                               y=housing2[housing2['before1980'] == 1]['status_V'],
                               mode='markers', name='before 1980', marker=dict(color='blue'))

trace3_after1980 = go.Scatter(x=housing2[housing2['before1980'] == 0]['yrbuilt'],
                              y=housing2[housing2['before1980'] == 0]['status_V'],
                              mode='markers', name='after 1980', marker=dict(color='red'))

# Create subplots
fig = make_subplots(rows=1, cols=2, subplot_titles=['numbaths - yrbuilt', 'status_v - yrbuilt'])

# Add traces to subplots
fig.add_trace(trace2_before1980, row=1, col=1)
fig.add_trace(trace2_after1980, row=1, col=1)

fig.add_trace(trace3_before1980, row=1, col=2)
fig.add_trace(trace3_after1980, row=1, col=2)

# Update layout
fig.update_layout(title_text='Scatter Plot Grid',
                  xaxis=dict(title='yrbuilt'),
                  yaxis=dict(title='numbaths'),
                  xaxis2=dict(title='yrbuilt'),
                  yaxis2=dict(title='status_v'),
                  legend=dict(traceorder='reversed'))

# Show plot
fig.show()

Classification Model

Now that we’ve identified which variables we will use, we use scikit and a learning algorithm to classify the data.

Note that yrbuilt will be removed from the data because it is a ‘giveaway’, and before_1980 will be used only for testing the data.

Show the code
housing2 = housing2.drop(columns = housing.columns[4])

feature_cols = ['stories', 'numbaths','quality_C', 'gartype_Att','gartype_Det','arcstyle_ONE-STORY','livearea','sprice','status_V',"abstrprd","finbsmnt","nocars","deduct"]
X = housing2[feature_cols] # Features
y = housing2.before1980 # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
Show the code
clf2 = RandomForestClassifier()

# Train Decision Tree Classifer
model = clf2.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = model.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.9105389482871482

We successfully got the model over 90% accuracy.

Feature Importances

Let’s take a look at which features we used were most important in the model.

Show the code
# Create a pandas Series with feature names as index
feature_importances_series = pd.Series(model.feature_importances_, index=feature_cols)

# Plot feature importances
feature_importances_series.plot(kind='bar')
plt.pyplot.title('Feature Importances')
plt.pyplot.xlabel('Feature')
plt.pyplot.ylabel('Importance')
plt.pyplot.show()

It is interesting to note that the importance of features does not necessarily align with how much variables correlated, interestingly enough. For example, status_V had a higher correlation with year built than livearea, but was the least important feature, while livearea was the most.

Using feature importance, I was also able to go back and reselect some additional factors that had high feature importance.

Model Justification

Below you can see the actual tree used.

Show the code
from sklearn.tree import export_graphviz
import graphviz

estimator = model.estimators_[0]

# Export the decision tree to a DOT file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names=feature_cols,  
                filled=True, rounded=True)

# Convert the DOT file to an image (PNG)
graphviz.render('dot', 'png', 'Tree.dot')
'Tree.dot.png'

Decision Tree

Compared to other models, the random forest model was more accurate by about .02%. It used livearea as a high importance feature. We can theorize that livearea, or the liveable area of a house might correlate highly with the year a house was built because different model years could have related livable areas. At any rate, the random forest classifier found livearea to be the highest importance and it worked well for the model’s performace.

Peformance

Below you can see the model’s accuracy, precision, and F1 score.

Show the code
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate precision
precision = metrics.precision_score(y_test, y_pred)
print("Precision:", precision)

# Calculate F1 score
f1 = metrics.f1_score(y_test, y_pred)
print("F1 Score:", f1)
Accuracy: 0.9105389482871482
Precision: 0.928521373510862
F1 Score: 0.9281961471103327

These scores indicate that the model is effective at making predictions.

Conclusion

Surprisingly, the feature importance for the model did not necessarily correlate with the linear correlation of a feature with year built. The random forest classification model performed well compared to other models and seemed an effective fit for the task due to it’s high accuracy, precision, and f1 score, which were higher than other models for the same dataset.